% File src/library/base/man/Encoding.Rd
% Part of the R package, https://www.R-project.org
% Copyright 1995-2019 R Core Team
% Distributed under GPL 2 or later

\name{Encoding}
\alias{Encoding}
\alias{Encoding<-}
\alias{enc2native}
\alias{enc2utf8}
\concept{encoding}
\title{Read or Set the Declared Encodings for a Character Vector}
\description{
  Read or set the declared encodings for a character vector.
}
\usage{
Encoding(x)

Encoding(x) <- value

enc2native(x)
enc2utf8(x)
}
\arguments{
  \item{x}{A character vector.}
  \item{value}{A character vector of positive length.}
}
\details{
  Character strings in \R can be declared to be encoded in
  \code{"latin1"} or \code{"UTF-8"} or as \code{"bytes"}.  These
  declarations can be read by \code{Encoding}, which will return a
  character vector of values \code{"latin1"}, \code{"UTF-8"}
  \code{"bytes"} or \code{"unknown"}, or set, when \code{value} is
  recycled as needed and other values are silently treated as
  \code{"unknown"}.  ASCII strings will never be marked with a declared
  encoding, since their representation is the same in all supported
  encodings.  Strings marked as \code{"bytes"} are intended to be
  non-ASCII strings which should be manipulated as bytes, and never
  converted to a character encoding (so writing them to a text file is
  supported only by \code{writeLines(useBytes = TRUE)}).
  % non-bug report PR#16327

  \code{enc2native} and \code{enc2utf8} convert elements of character
  vectors to the native encoding or UTF-8 respectively, taking any
  marked encoding into account.  They are \link{primitive} functions,
  designed to do minimal copying.

  There are other ways for character strings to acquire a declared
  encoding apart from explicitly setting it (and these have changed as
  \R has evolved).  Functions \code{\link{scan}},
  \code{\link{read.table}}, \code{\link{readLines}}, and
  \code{\link{parse}} have an \code{encoding} argument that is used to
  declare encodings, \code{\link{iconv}} declares encodings from its
  \code{to} argument, and console input in suitable locales is also
  declared.  \code{\link{intToUtf8}} declares its output as
  \code{"UTF-8"}, and output text connections (see
  \code{\link{textConnection}}) are marked if running in a
  suitable locale.  Under some circumstances (see its help page)
  \code{\link{source}(encoding=)} will mark encodings of character
  strings it outputs.

  Most character manipulation functions will set the encoding on output
  strings if it was declared on the corresponding input.  These include
  \code{\link{chartr}}, \code{\link{strsplit}(useBytes = FALSE)},
  \code{\link{tolower}} and \code{\link{toupper}} as well as
  \code{\link{sub}(useBytes = FALSE)} and \code{\link{gsub}(useBytes =
  FALSE)}.  Note that such functions do not \emph{preserve} the
  encoding, but if they know the input encoding and that the string has
  been successfully re-encoded (to the current encoding or UTF-8), they
  mark the output.

  \code{\link{substr}} does preserve the encoding, and
  \code{\link{chartr}}, \code{\link{tolower}} and \code{\link{toupper}}
  preserve UTF-8 encoding on systems with Unicode wide characters.  With
  their \code{fixed} and \code{perl} options, \code{\link{strsplit}},
  \code{\link{sub}} and \code{gsub} will give a marked UTF-8 result if
  any of the inputs are UTF-8.

  \code{\link{paste}} and \code{\link{sprintf}} return elements marked
  as bytes if any of the corresponding inputs is marked as bytes, and
  otherwise marked as UTF-8 of any of the inputs is marked as UTF-8.

  \code{\link{match}}, \code{\link{pmatch}}, \code{\link{charmatch}},
  \code{\link{duplicated}} and \code{\link{unique}} all match in UTF-8
  if any of the elements are marked as UTF-8.
  
  There is some ambiguity as to what is meant by a \sQuote{Latin-1}
  locale, since some OSes (notably Windows) make use of character
  positions used for control characters in the ISO 8859-1 character set.
  How such characters are interpreted is system-dependent but as from \R
  3.5.0 they are if possible interpreted as per Windows codepage 1252
  (which Microsoft calls \sQuote{Windows Latin 1 (ANSI)}) when
  converting to e.g.\sspace{}UTF-8.
}
\value{
  A character vector.

  For \code{enc2utf8} encodings are always marked: they are for
  \code{enc2native} in UTF-8 and Latin-1 locales.
}
\examples{
## x is intended to be in latin1
x <- "fa\xE7ile"
Encoding(x)
Encoding(x) <- "latin1"
x
xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))
c(x, xx)
Encoding(xx) <- "bytes"
xx # will be encoded in hex
cat("xx = ", xx, "\n", sep = "")
}
\keyword{utilities}
\keyword{character}
